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I.      INTRODUCTION 

The  last  decade  has  seen  the  general  acceptance  of  the  importance 
of  syntax  description  and  syntactical  analysis  in  the  development  of, 
and  support  for,  new  automatic  programming  languages  for  digital  com- 
puters . 

Since    the    introduction   of    the    stored    program   computer,    communication 
between  man  and   his    computer   has    been  a    problem.      The    laborious   and    time 
consuming    numeric   coding    of  machine    language    inhibited   many   potential 
computer   users    from    learning    how   to   communicate   directly  with    the    com- 
puter.     The   man  with   a    computer   solvable    problem   often   found   manual 
calculations    easier    than   trying    to   communicate   with   either    the   expert 
programmer   or   the   machine.      A  means    of   communication  with    the    computer, 
more   easily    learned    than  machine    language,   was    needed.      This    led    to   the 
development    of    problem-oriented    languages. 

Problem-oriented  languages  like  ALGOL,  PL/l,  and  FORTRAN,  were 
designed  to  facilitate  communication  between  user  and  computer  for 
solutions    to  mathematical,    and    other   special    interest    problems. 

These   automatic    programming    languages   generally   consisted    of  a 
vocabulary  which    incorporated   many  key  words    from  a    profession   or  an 
interest   area.      The    user   could    instruct    the    computer    in  a    language 
similar   to   ordinary   English   usage.      For   example,    instead    of    trying 
to  manipulate   an  array  by    laboriously   coding   a    sequence    of    instructions 
to  achieve    the    transpose   of  a  matrix,    the   user  could  achieve   the   same 
goal   by    the   use    of  a    special   reserved  word,    such  as    "TRANSPOSE".      It 
should   be    noted,    however,    that    these    special   reserved   words,    such  as 


"TRANSPOSE",  have  a  specific,  or  nonredundant  meaning.  The  programmer 
must  use  these  words  in  accordance  with  the  coding  restrictions  of  the 
particular  language  in  which  he  is  programming. 

The  advent  of  the  automatic  programming  languages  attracted  more 
users  to  the  computer.   These  new  users,  as  they  discovered  more 
applications  for  the  computer,  requested  more  languages  specifically 
designed  for  their  particular  interests.   Computer  specialists 
responded  by  developing  more  new  languages. 

The  introduction  of  time  sharing  systems,  with  many  terminals, 
magnified  the  problem  of  satisfying  many  users  and  their  programming 
language  requirements.   Whereas  previously  only  the  large  corporations 
could  afford  a  computer  installation,  now  many  small  businesses  were 
able  to  share  a  central  installation.   The  over-all  effect  provided 
a  large  body  of  users,  solving  computer  problems  with  a  wide  variety 
of  special  purpose  languages. 

The  cost  of  these  time  sharing  systems  is  distributed  among  the 
users.   Each  user  is  charged  for  the  amount  of  computer  processor 
time  used  by  his  program.   Obviously,  the  user  is  interested  in 
minimizing  his  processor  time,  thus  decreasing  costs  and  increasing 
profits.   However,  many  of  the  users  of  time  sharing  systems  are  not 
completely  familiar  with  the  restrictions  of  the  particular  language 
in  use.   As  a  result,  they  use  the  computer  to  "trouble  shoot"  their 
programs  for  syntax  errors.   An  experienced  programmer  finds  that 
programs  seldom  compile  correctly  on  the  first  attempt.   Yet  com- 
pilation of  entire  programs  is  attempted  on  each  occasion  with  the 
associated  consumption  of  expensive  processor  time.   The  need  is 
apparent  for  a  tool  that  will  assist  the  programmer  in  eliminating 
syntax  errors,  while  minimizing  expensive  processor  time. 
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A  by-product  of  reducing  the  number  of  attempts  to  compile  is 
the  additional  processor  time  available.   This  time  can  be  used  to 
either  improve  response  time  to  the  current  users,  or  to  provide 
service  to  new  users . 

Thus  the  objective  of  this  thesis:   to  provide  a  basic  universal 
syntax  checking  program  which  could  be  expanded  for  use  in  a  time- 
sharing environment.   This  program  is  hereafter  referred  to  as  the 
Universal  Syntax  Checker.   This  syntax  checking  program  accepts  the 
description  of  all  languages  and  provides  a  syntax  check  of  programs 
written  in  the  described  languages,  as  shown  in  Figure  1-1.   The 
syntax  checker  requires  less  facility  resources  than  a  language 
processor,  generates  no  code,  and  is  intended  for  use  in  a  time- 
sharing environment  where  the  various  users  may  time-share  the 
universal  syntax  checker  regardless  of  the  language  in  use  at  each 
terminal. 


********************* 
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* 
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* 
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Syntax 
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Program 


************* 


Figure  1-1  Universal  Syntax  Checker 
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II.   DEFINITIONS 

A.   SYNTAX  AND  SEMANTICS 

In  order  to  understand  the  operation  of  the  universal  syntax  checker, 
a  knowledge  of  the  methods  used  to  formulate  and  describe  languages  is 
necessary. 

There  is  a  lack  of  standard  uniform  notation  throughout  the  litera- 
ture, therefore  the  definitions  and  notation  from  [1]  and  [2]  are 
selected  for  use  in  this  discussion. 

The  problem  of  describing  a  language  involves  a  unique  difficulty. 
During  the  discussion,  a  distinction  must  be  made  between  the  language 
being  described  and  the  describing  language.   If  the  language  being 
described  is  called  the  "language",  then  the  language  in  terms  of 
which  the  description  is  being  made  is  called  the  "metalanguage". 
If  these  two  languages  are  not  distinguishable,  much  ambiguity  and 
imprecision  results.   A  discussion  of  how  languages  are  described  is 
in  order. 

In  the  exposition  of  automatic  programming  languages,  the 
descriptions  themselves  can  be  classified  into  two  types;  syntactic 
definitions  and  semantic  discussions.   The  syntactic  definitions  make 
explicit  what  structures  are  to  be  meaningful  in  the  language.   These 
definitions  are  concerned  with  the  proper  construction  of  words, 
expressions,  and  statements.   The  semantic  discussions  are  concerned 
with  the  meanings  to  be  attributed  to  the  various  structures  and  their 
proper  usage  in  the  language. 
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It  often  happens  that  in  the  process  of  describing  a  new  computer 
language,  the  formats  for  statements  are  presented  in  terms  of  the  natural 
language.   These  formats  define  which  components  are  necessary  and  how 
they  are  to  be  put  together  to  form  meaningful  statements.   These  are 
the  syntax  and  semantics  of  the  language. 

The  syntax  of  a  FORTRAN  DO  statement  can  be  given  as  the  natural 
language  definition: 

DO  n  i  =  m  ,  m  ,  m„  ,  where, 

n  is  a  statement  identifier, 

i  is  a  simple  integer, 

and  m1 ,  m„ ,  m   are  simple  integer  variables. 
The  syntax  definition,  however,  is  not  sufficient  to  specify  a 
properly  formed  and  meaningful  DO  statement.   It  is  also  necessary 
to  discuss  the  semantics  of  the  DO  statement.   The  semantics  could 
be  given  as  follows: 

V(m2)  >  V(m,)  >  0  and  V(m  )  >  0 

where  V(m.)  is  the  value  at  execution  of  the  variable  or  constant  m  . 
i  i 

It  is  also  necessary  that  n  refer  to  a  statement  not  previously  defined 
in  the  subprogram  [1]. 

In  general,  the  syntax  and  semantics  of  an  entire  language,  (i.e., 
of  each  statement  in  the  language),  are  given  in  order  to  specify  the 
proper  construction  of  meaningful  statements  in  that  language. 

To  formalize  the  definitions  in  the  metalanguage,  each  definition 
is  given  the  form  of  a  statement  or  construct,  sometimes  called  a 
production  [3].   Thus  the  syntactic  definition  of  an  "unsigned  integer" 
could  be : 
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An  unsigned  integer  can  be  formed  by  writing 
a  digit  or 

an  unsigned  integer  followed  by  a  digit. 
This  definition  is  also  an  example  of  a  recursive  definition.   That 
is,  the  term  "unsigned  integer"  is  defined  in  terms  of  itself. 

Syntactic  definitions  can  frequently  be  shortened  by  using  a 
metalanguage  symbolism.   The  following  symbols  will  be  used: 
SYMBOL     INTERPRETATION 

<  x  >      angular  brackets;  a  syntactic  category,  the 

object  named  x. 
:  :=        "can  be  written  as"  or,  "is  defined  as", 
reads  as  "or". 
To  repeat  the  recursive  definition  of  an  unsigned  integer  using 
the  above  notation: 

<  unsigned  integer  >  : :=  <  digit  >  | 

<  unsigned  integer  >  <  digit  > 
This  method  of  syntactic  definition  of  a  language  is  called  "specifica- 
tion by  Backus-normal  form,"  and  abbreviated  B.N.F.  . 

The  formalism  for  the  semantic  discussion  of  a  particular  language 
is  not  readily  available  or  apparent.   Although  attempts  have  been 
made  to  make  this  formalization  [1],  it  is  not  of  concern  here  since 
the  universal  syntax  checker  verifies  the  syntax,  not  the  semantics 
of  a  particular  language. 


Actually,  the  form  is  not  normal,  B.N.F.  is  often  called  Backus-Naur 
form.  Naur  introduced  the  actual  notation,  and  Backus  introduced  the 
concept  [2 ] . 
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B.   GRAMMAR  AND  LANGUAGE 

As  mentioned  earlier,  the  universal  syntax  checker  accepts  the 
definition  of  a  language  and  programs  written  in  the  language. 

The  second  input  to  the  checker,  (a  program  written  in  the 
described  language),  necessitates  a  discussion  of  grammar. 

Mention  of  the  word  "grammar"  brings  to  mind  many  thoughts.   Most 

people  relate  the  word  to  something  learned  in  school,  or  think  of  a 

set  of  words  and  rules  which  describe  some  language.   Webster  defines 

grammar  in  the  following  manner: 

The  science  treating  of  the  classes  of  words,  their 
inflections,  and  their  syntactical  relations. 

Most  of  us  recall  disecting  and  diagramming  sentences  to  learn  the 

syntactical  relation  of  the  parts  of  a  sentence.   This  is  known  as 

parsing.   For  example,  the  sentence  "The  little  girl  talks  fast.", 

may  be  disected  after  a  syntactic  definition  of  a  small  subset  of 

English . 

1  <  sentence  >       : :=  <  noun  phrase  >    <  verb  phrase  > 

2  <  noun  phrase  >     :  :=  <  adjective  >      <  noun  phrase  >  | 

3  <  adjective  >      <  singular  noun  > 

4  <  verb  phrase  >     : :=  <  singular  verb  >  <  adverb  > 

5  <  adjective  >      : :=   the 

6  little 

7  <   singular    noun  >      :  :=     girl 

8  <   singular  verb   >      :  :=      talks 

9  <   adverb  >  :  :=      fast 

Figure  2-1.   Constructs  of  a  sentence 
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Through  the  application  of  the  syntax  rules  the  sentence  is  parsed. 
The  results,  when  diagrammed,  yield  a  syntax  tree.   Figure  2-2  shows 
the  diagram  of  this  particular  syntax  tree. 

<  sentence  > 


noun  phrase  > 


verb  phrase  > 


<  noun  phrase  > 


<  adjective  >  \.     <  singular  verb  >    <  adverb  "> 

I  \  I  i 

<  adjective  >  <  singular  noun  >  ! 

The  little  girl      talks  fast 

Figure  2-2.   Diagram  of  the  sentence 
"The  little  girl  talks  fast." 

The  application  of  the  rules  in  Figure  2-1  to  the  sentence  "The 
little  girl  talks  fast."  reveals  that  the  sentence  is  grammatically 
correct.   One  should  observe  that  sentences  cannot  only  be  tested 
but  may  also  be  generated  by  the  application  of  the  same  rules.   Thus 
application  of  rule  1  followed  by  as  many  applications  of  rule  2  as 
desired,  and  so  on  until  no  further  application  of  rules  is  possible, 
would  also  yield  a  grammatically  correct  sentence.   The  sentence  might 
not  mean  anything,  but  it  would  be  grammatically  correct.   The  semantics 
of  a  language  prevent  improper  statement   formulation.   For  example, 
the  application  of  rule  1,  followed  by  the  application  of  rule  2  twice, 
would  yield  the  sentence,  "The  the  little  girl  talks  fast,"  which  is 
grammatically  correct,  but  nonsensical  just  the  same. 

To  formalize  the  grammar  presented,  four  ingredients  were  necessary. 
First,  the  ingredient  of  syntactic  categories,  the  noun  phrase,  adjective, 
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etc.,    from   which    strings    of  words   were    derived.      These    syntactic 
categories    are    called    non-terminal    symbols.       Second,    there   were    the 
words    from  which    the    sentences    could    be    constructed.      These    objects 
are    called    terminal    symbols.      Third,    the    relations    between    the    various 
strings    of    terminal   and    non-terminal    symbols   which    are    called    pro- 
ductions.     These    productions   are    the    rules    of   syntax,    as    shown   in 
Figure    2-1.      And    last,    there  was    one    distinguished    non-terminal 
symbol,    which    appeared    nowhere    on    the    right    of   some    production   rule. 
This    distinguished    symbol    is    referred    to  as    the    start    symbol    [4],    or 
the    goal    symbol    [2].       The    non-terminal    symbol    "sentence"   was    the    goal 
symbol    in    the    example    given. 

Thus    a   grammar   G  may   be    defined   as    (V    ,    V    ,    P,    Z) .      The    symbols 
V    ,    V    ,    P,    and   Z   represent    in   order,    the    set    of   non-terminal   symbols, 
the    set    of    terminal   symbols,    a    set    of    productions,    and    the   goal   symbol 

Let   V*,    V*,    V*  be    finite    concatenations    of   symbols    from  V    ,    V    , 
n't  '  n't' 

and   V      union   V      respectively.      These   concatenations    are    called   strings, 

* 

Let     :  :=   represent    the   application   of   a    finite    sequence    of    pro- 
G 

ductions    from   G.      8    f   V*   is    said    to   be   a    "derivation"    of   a  f   V      if 

t  n 

* 

and  only  if  there  exists  a  sequence  of  productions  such  that  a  :  :=  8  • 

Let  G  be  the  subset  of  English  defined  earlier.   Letting  a  = 

<  sentence  >,  8  =  "The  little  girl  talks  fast.",  the  parsing  tree 

* 

of  Figure  2-2  3hows  that  <  sentence  >  : :=  "The  little  girls  talks 

G 

fast." 

I  * 

A  language  L(G)  is  defined  as  (w|w  is  in  V*  and  Z  : :=  w)  [4]. 

t  G 

That  is,  a  string  is  in  L(G)  if  and  only  if: 
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1.  The  string  consists  of  a  finite  string  of  terminal  symbols, 
and 

2.  the  string  can  be  derived  from  Z  by  the  application  of  a 
finite  sequence  of  productions  from  the  grammar  G. 

The  set  of  finite  strings  of  terminal  symbols  defined  by  a  grammar  G 
is  called  the  set  of  final  sentential  forms. 

Now  the  additional  input  to  the  universal  syntax  checker  may  be 
firmly  identified.   The  statement,  "and  a  program  written  in  the 
described  language,"  means  that  the  syntax  checker  is  provided  a 
string  (program)  which  is  a  member  of  V*. 

Thus  to  summarize,  the  universal  syntax  checker  receives  the 
following  two  inputs  : 

1.  A  description  of  a  language  L  in  B.N.F.  (The  productions, 
non-terminal  symbols,  and  terminal  symbols.) 

2.  A  finite  string  of  terminal  symbols  from  the  described 
language.   (An  element  of  V*.) 

A  formal  description  of  the  universal  syntax  checker  is  now 
possible . 

Given  the  B.N.F.  of  a  language  L,  and  a  finite  string  of  terminal 
symbols  from  the  described  language,  V*,  the  universal  syntax  checker 
will  determine  whether  that  string  is,  or  is  not,  a  member  of  L(G)  for 
the  language  L  defined. 

C.   RELATED  RESEARCH 

The  use  of  "syntax-directed"  techniques  is  not  a  new  one.   This 
technique  has  been  used  in  constructing  natural  language  translators 
[5],  compilers  [6,7],  automatic  programming  language  translators  [8, 
9,  10],  and  context-free  grammar  recognizers  [11]. 
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Syntax   directed    analysis    of    natural    languages    is    an   unsolved    problem 
due    primarily    to    the    inability    to    precisely   define    the    syntax    of   each 
language.      An    excellent    survey    of    the    techniques    employed    can   be    found 
in   Bobrow    [ 5 ] . 

Griffiths    and    Petrick,     [12]    describe   many    types    of    recognition 
procedures,    all    syntax-directed,    for   context-free   grammars.      They 
employ   Turing  Machine    algorithms    to    highlight    the    various   methods. 
Turing   machines   were    used    to    preserve    clarity,    conciseness,    and    to 
allow   comparisons    of    procedures    on   the    same    level    of    complexity.      The 
universal   syntax   checker  more   closely  approaches    the    selective    top    to 
bottom    (STB)   method    than   any    other   described. 

Irons    [6],    develops    a   syntax-directed   ALGOL  60   compiler.      The 
arrangement    of   various    tables    to   contain   the   specifications    of   the 
language    are    very   similar   to    those   employed    in   the    universal   syntax 
checker.      Each   specifies    entries    for   all   syntactic   units,    as    integers, 
in   the    described    language   with   a    one-to-one    correspondence    between   the 
actual   symbol   vector  and    the    integer   representation   of   that   description. 
The    two  methods    do   differ    in    that    Irons    uses    a    bottom-to-top   method. 
Thus,    his    need    for  a    precedence   matrix   to   ensure    that    the    longest 
string    possible    is    accepted.      The    syntax  checker   requires    reordering 
of   the    B.N.F.    to   present    the    longest    string    possible   before   any 
shorter   string. 

Feldman  and    Gries    discuss    the    pushdown   stack  method    [3],    as    one 
of   the    two  ways    to  achieve    a    parsing    process.      This    same   method    is 
an    integral    part    of   the    "cellar   principle"    used    to  design  a    syntax 
controlled   generator    of    formal    language    processors    [13].      The    cellar 
principle    is    based    on   sequential    processing    of    the    input    symbols. 


The  syntax  checker  also  ep^ploys  a  sequential  processing  of  symbols  and 
the  recursive  procedures  available  in  PL/ 1  incorporate  the  pushdown 
stack  implicitly  for  all  variables. 

Barnett  and  Futrelle  presents  an  account  of  the  SHADOW  language 
that  is  used  to  describe  a  syntax  of  a  language  and  an  associated 
subroutine  which  parses  an  input  string  [14],   The  method  requires 
that  a  mnemonic  argument,  in  addition  to  the  input  string,  be  provided 
to  the  SHADOW  subroutine.   These  arguments  include  the  names  of  arrays 
which  contain  the  string  and  the  syntax.   Thus,  if  a  parse  of  a  rational 
fraction  is  desired  the  mnemonic  RATFRN  is  a  required  input  argument. 
The  appearance  of  this  mnemonic,  requesting  a  syntactic  analysis, 
causes  the  subroutine  to  use  the  most  recently  read  requested  pattern 
name  and  input  string.   This  system  is  obviously  unsuited  for  the  time- 
sharing environment  due  to  the  requirement  that  the  programmer  be 
familiar  with  the  syntax  of  the  language  in  use. 

Unger  presents  a  Global  Parser  (GP) ,  for  phrase  structured 
grammars  [11].   The  method  he  employs  is  also  very  close  to  the  method 
used  in  the  universal  syntax  checker.   Some  major  differences  do  exist. 
The  primary  difference  is  his  use  of  a  set  of  routines  for  determining 
possible  prefixes  and  suffixes  of  N-derivable  strings,  finding  the 
minimum  length  of  such  strings,  and  sub-strings  that  can  never  appear 
in  N-derivable  strings.   Another  difference  is  the  method  of  comparison 
between  the  production  and  input  string.   Unger  claims  that  by  matching 
all  the  terminal  symbols  of  the  intermediate  string  against  the  input 
string  and  constantly  partitioning  the  input  string,  a  broad  class  of 
checks  can  be  made  to  terminate  fruitless  paths  quickly.   This  parser 
will  not  handle  cyclic  definitions. 
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An  error  correcting  parse  algorithm  is  described  by  Irons  utilizing 
a  syntax-directed  scheme  [15].   The  algorithm  provides  two  services. 
First,  the  algorithm  provides  a  parse  of  strings  written  in  the  language 
described.   Second,  if  an  incorrect  string  is  presented,  the  algorithm 
will  make  substitutions,  insertions,  and  deletions  to  make  the  object 
string  syntactically  correct.   The  approach  is  different  from  the 
syntax  checker  in  that  all  possible  parses  are  carried  along  until 
one  can  be  determined  to  be  correct.   Backtracking  and  recursive  pro- 
ductions are  not  allowed.   Recursive  productions  are  replaced  by  a 
similar  powerful  definition,  which  allows  iterations  in  pairs  of 
symbols . 

Most  authors  agree  that  there  are  several  advantages  and  dis- 
advantages to  syntax-directed  procedures.   Among  the  advantages  are 
the  simplicity  of  compiler  construction  and  the  ability  to  change  the 
specifications  of  a  particular  language.   In  addition  languages  may  be 
switched  by  merely  changing  the  contents  of  the  syntax  table.   Further, 
the  syntax-directed  parser  can  take  into  account  as  large  a  context  as 
is  required  to  perform  the  parsing.   The  disadvantages  include  the 
difficulty  in  trying  to  specify  the  syntax  of  a  language,  the  fact 
that  syntax-directed  compilers  contribute  little  toward  the  generation 
of  optimum  code,  and  generally  poor  error  analysis  and  recovery.   Error 
type  determination  is  nearly  impossible. 

D.   PARSING  METHOD 

It  has  been  shown  that  taking  a  string  of  symbols  and  a  grammar, 
and  constructing  a  derivation  of  the  string  to  form  its  syntax  tree, 
is  called  parsing,  recognizing,  or  analyzing. 
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There    are    two   basic    types    of    parsing   methods:       top-down   and    bottom-up 
The    bottom-up   method   will    not    be    covered    but    is    explained    in    [3].      The 
top-down    parser    gets    its    name    from    the    fact    that    it    is    goal    oriented. 
The    top-down    parser    starts    with    the    most    global    production    (goal    symbol) 
and   works    its    way   down    the    productions,    attempting    to   match    the    input 
string.       Each   method    can   be    further    qualified   as    left-right    or    right- 
left,    depending    on    the    order    of    processing    the    symbols. 

In   order    to   discuss    the   manner   employed    by   the    syntax   checker    to 
accomplish    its    assigned    tasks,    consider    the    following   example    of  a 
language   and   a   method    to   parse    such   a    lnaguage  : 

1 .  Grammar 

Given   the    following   grammar; 
<Z>    ::=     l<a>_|_ 

<  a  >    :  :=         <  b  >    | 

<  a   >  +  <   b   > 

<  b  >    :  :=         <  c   >    | 

<  b   >   -   <  c   > 

<  c   >    : :=  T 

(  <  a  >   ) 
Non-terminal   symbols:      Z,    a,    b,    c. 
Terminal   symbols:  T,   J_,   +,    -,    (,    ). 

2 .  Derivations 

To   discuss   ambiguity   it    is    necessary    first    to   discuss 
derivations.      Observe    that    the    following    string   has   more    than    one 
derivation.      Thus,   <c>+<b>-<c>  can  be   derived   by   two   sets 
of    productions : 
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<Z>::=_j_<a>J     ::=<a>  +  <b>::=<b>  +  <b> 

::=       <c>     +       <b>::=<c>+<b>     -<c>  and, 
<Z>    ::=_J_<a>J_::=<a>  +  <b>    ::=<a>+<b>-<c> 
::=       <b>         +<b>     -<c>::=<c>+<b>-<c> 
3 .      Syntax  Trees 

The  construction  of  the  syntax  tree  is  now  shown; 

<  Z   > 

<  a  >         +  <  b  > 
<b>                        <b>        -        <c> 

I  I 

<  c   >  <  c   > 

I  I 

T  T  T 

Although    there   may  be   more    than   one    set    of   productions 
which   yields    the    same    string    of   symbols,    their   syntax-trees   are    the 
same.      Ambiguity   exists    only  when   one    or  more    strings    have   more    than 
one   syntax   tree. 

Observe    that    the    syntax   tree    shown  below  would    result    if    the 
string   <c>  +  <b>-<c>,    not   a   member    of   L(G),  is    parsed   by 
proceeding    left   at    every   junction  down   the    tree    (applying    the    left- 
most  derivation). 

<  Z  > 

1    <i>  1 

<  b  > 


<   c   > 

T 
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4 .       Procedure    for    a    top-down    left-right    parser 

To  acquire    familiarity  with    the    top-down    left-right    parser, 
consider   the    following   procedure.      The    top-down    left-right    parser  makes 
an   initial   assumption   that    the    string    presented    is    a   valid    final 
sentential    form,    and    then   proceeds    as    follows: 

step    1.      Establish   goal   symbol,    in   this    case    Z. 
step   2.      Apply  a    production   to    the   goal   or   subgoal. 
step   3.      First    symbol    of   resultant    string   a    terminal? 

Yes,    continue;    No,    go   to   step   6. 
step   4.      Terminal   symbol   and    input    string   match? 

Yes,    continue;   No,    go   to   step   7. 
step   5.      More    symbols    in   string    of    production? 

Yes,   continue;   No,   done, 
step   6.      Establish   subgoal  and   go   to  step  2. 
step   7.      An  alternate,    (OR),    production? 
Yes,    go   to   step   6;   No,    done. 
A  slightly  modified    form  of   the   above    procedure   will   recognize 
grammars   with    left,    right,    and   self-embedded    recursion.      In   addition, 
ambiguous    grammars    are    processed   such    that   when   two   or  more    syntax 
trees   are   available,    the    first,    as   determined   by    its    placement    in   the 
B.N.F.,    will   be    recognized. 

This    simplified    procedure    has   been   provided    to  acquaint    the 
reader  with   the   general    procedure    of   top-down    left-right    parsing. 
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III.      SUPPORT    ROUTINES 

A.  DISCUSSION 

The  previous  chapters  presented  the  schema  of  the  universal  syntax 
checker.   Included  was  an  explanation  of  the  input  requirements  of  the 
system  as  well  as  a  simplified  method  to  accomplish  the  syntax  checking 
task  . 

Before  a  detailed  discussion  of  the  actual  parsing  algorithm  can 
begin,  it  is  necessary  to  describe  certain  grammar  manipulations  and 
tables  required  to  support  the  algorithm. 

B.  TABLES 

The  manner  in  which  the  grammar  is  placed  into  the  supporting 
tables  is  now  presented.   Note  that  the  identifiers  enclosed  in  the 
parentheses  immediately  following  the  table  titles,  are  the  actual 
identifiers  which  were  used  in  the  coding  of  the  universal  syntax 
checker  program. 

1.   B.N.F.  Table  (P) 

This  table  contains  the  definitions  of  the  B.N.F.  in  the  form 
<  identifier  >  : :=  <  letter  >  |  <  identifier  >  <  letter  >,  and  appears 
in  the  table  as  shown  in  Figure  3-1,  where  i  is  the  number  of  symbols 
in  the  production  and  j  is  the  number  of  productions  describing  the 
language.   The  example  is  taken  from  the  first  grammar  listed  in  the 
computer  program  listing. 
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IDENTIFIER 


LETTER 


LETTER 

IDENTIFIER 

A 

B 


) 


LETTER 


i 

i 

_L_ 


Figure  3-1.   The  B.N.F.  table. 
2.   Symbol  Table  (SYMTAB) 

The  symbol  table  is  constructed  from  the  grammar  placed  in 

the  B.N.F.  table.   This  symbol  table  contains  all  the  elements  of  V 

J  n 

followed   by  all    the    elements    of  V    .      The    first    entry    in   the    table    is 
the    empty   string. 

Since    no   terminal   symbol  may  appear  as    the    left    part    of   a    pro- 
duction,   the    non-terminal   symbols   were    located   by   examining    the 
first    column   of    the   B.N.F.    table   and   applying    the    following    logic: 
step    1.      Is    the    symbol   being   examined    listed    in   the 
symbol    table? 

Yes,    examine    next    symbol;    No,    go    to   step   2. 
step  2.      Put    the    symbol    in    the    symbol    table,    increment 

the    table    pointer,    and   examine    the    next    symbol. 
After    noting    the    location   of    the    last    non-terminal    symbol,    the 
terminal   symbols    are    placed    into    the    table.      Since    no   terminal 
symbol   may  appear   on   the    left    of   a    production,    these    symbols    are 
determined    using    the    logic   as    before,    excluding    the    first    column 
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and    examining    the    remaining    columns    of    the    B.N.F.    table.       Referring    to 
the    grammar    shown    in   Figure    3-1,    Figure    3-2    shows    the    symbol    table    upon 
completion    of    its    construction.      The    identifier,    nsymbols ,    refers    to 
the    total    number    of    symbols    in    the  grammar. 
3.       Onright    Table    (ONRIGHT) 

There    is    a    one-to-one    correspondence    between    the    subscript 
numbers    of    the    symbol    table    and    the    subscript    numbers    of    the    onright 
table  . 

SYMTAB 


1 

IDENTIFIER 

2 

LETTER 

3 

— 

. 

40 

last    non-terminal 

41 
42 

A 

B 

43 

C 

• 

al 

.ymbols 

(last    symbol) 

Figure  3-2.   The  Symbol  table. 
Thus,  ONRIGHT  (i)  is  marked  "true"  if  and  only  if  the  symbol  in 
SYMTAB  (i)  appears  in  the  r,ight  part  of  some  production.   A  search 
of  the  "false"  entries  in  the  onright  table  produces  the  goal  symbol. 
More  than  one  "false"  entry  indicates  that  the  goal  symbol  is  not 
unique  and  therefore  the  grammar  is  unacceptable.   Figure  3-3  depicts 
the  onright  table  for  an  acceptable  grammar. 
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ONRIGHT 


1 

True 

i 

2 

False 

3 

True 

4 

True 

• 

ii 
ii 
it 

nsymbols 

True 

Figure  3-3.   The  Onright  table. 
4.   Production  Table  (PR) 

There  is  a  one-to-one  correspondence  between  the  production 
table  (PR)  and  B.N.F.  table  (P) .   Also,  there  exists  an  onto  mapping 
from  the  entries  in  the  production  table  to  the  subscript  numbers  of 
the  symbol  table.   Thus,  wherever  the  symbol  in  SYMTAB  (i)  appears 
in  the  B.N.F.  table,  the  entry  "i"  is  made  in  the  production  table. 
For  example,  consider  the  symbol  table  in  Figure  3-2.   Wherever  the 
symbol  IDENTIFIER  appears  in  the  B.N.F.  table  shown  in  Figure  3-1, 
an  entry  "2"  is  made  in  the  production  table  as  shown  in  Figure  3-4. 
Note  the  recursive  production  occurring  in  the  second  row. 

PR 


1 

2 

3 

4 

5 

i 

1 

2 

3 

2 

2 

2 

3 

3 

3 

41 

4 

3 

42 

5 

3 

43 

J 

Figure  3-4.   The  Production  table 
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IV.      THE   PARSING  ALGORITHM 

A.  BACKGROUND 

The  inputs,  general  parsing  procedure,  and  support  tables  of  the 
universal  syntax  checker  have  been  described.   A  presentation  of  the 
actual  programming  implementation  remains.   This  implementation  was 
accomplished  utilizing  the  recursive  procedures  available  in  the  PL/ 1 
programming  language.   The  operation  of  the  major  procedure,  called 
RECOGNIZE,  is  dependent  upon  a  symbol  buffer  and  accumulator.   The 
buffer  contains  a  portion  of  the  input  string.   The  accumulator 
provides  symbol  back-up  and  is  the  heart  of  the  procedure  NEXTSYM. 
These  two  procedures  will  be  discussed  briefly  before  a  description 
of  the  algorithm  is  given  in  reference  ALGOL  [16]. 

B .  PROCEDURES 
1.   Nextsym 

The  NEXTSYM  procedure  centers  around  the  accumulator  shown 
in  Figure  4-1.   Each  element  of  the  accumulator  contains  an  entry 
"i"  corresponding  to  SYMTAB  (i)  for  each  symbol  recognized  in  the 
buffer . 

Associated  with  the  accumulator  are  three  pointers:   ap,  tap, 
and  acclen.   The  accumulator  pointer,  ap,  points  to  the  symbol  being 
examined.   The  temporary  accumulator  pointer,  tap,  retains  the  value 
of  the  accumulator  pointer  at  each  junction  in  the  syntax  tree  that 
the  procedure  examines.   The  accumulator  length  pointer,  acclen, 
retains  the  total  number  of  accumulator  positions  filled. 
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Accumulator 
6     7 


47     48    49 


50 


35 

60 

42 

43 

I 

i 

.  _- 

I               1 
i 

i -  ...  .j ,               

t 

ap 
ace  len 


Figure  4-1.   Accumulator 

The  accumulator  will  hold  the  first  fifty  symbols  from  the  input 
string.   Thereafter,  it  will  contain  the  most  recent  forty  to  fifty 
symbols.   The  variable  "offset"  serves  to  adjust  the  subscript  number 
i  of  the  accumulator  while  allowing  the  accumulator  pointer  to  increase 
sequent ia lly . 

2 .   Recognize 

The    RECOGNIZE   procedure    is    a    top-down      left -right    slow-back 
parser.      The   method,    as    described    in  Chapter   II,    has    been   implemented 
with   one    exception:      the    parser   attempts    to   recognize    the    left-most 
derivation   first.      The   B.N.F.    presented    to    the    universal   syntax   checker 
must    be   modified   so    that    the    left-most   derivation   is    the    longest    string 
possible    (see   Appendix  A). 

C.   THE  SYNTAX  CHECKING  ALGORITHM 

The  description  of  the  syntax  checking  algorithm  follows  (refer- 
ence ALGOL).   The  ALGOL  version  of  the  algorithm  has  not  been  run  on 
a  computer,  although  the  PL/l  version  presented  in  the  computer  pro- 
gram section  has  been  tested,  and  the  results  are  in  the  computer 
output  section. 
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procedure  MAIN ; 

integer  bp,  ap,  acclen,  offset,  i,  j,  k,  npr ,  nonterm, 

nsymbols,  bl,  linecount ,  target,  scan,  a; 

array  Symtab  [ 1 :nsymbols ] ,  Buffer  [1:80],  Accum  [0:50], 

PR  [ 1 :  n ,  1:8]; 

boolean  array  Onright  [ 1 :nsymbols ] ; 

boolean  change,  empty,  done,  check; 

comment   This  procedure  is  a  top-down  left-right  slow-back  parser. 
To  recognize  a  final  sentential  form  of  the  B.N.F.  presented  to 
the  procedure,  the  distinguished  symbol  is  established  as  the 
goal  symbol,  and  productions  are  applied,  where  valid,  in  the 
order  presented.   Once  a  terminal  symbol  is  encountered,  the 
symbol  is  matched  with  the  next  symbol  in  the  input  string. 
If  the  match  is  successful,  an  attempt  is  made  to  match  the 
next  symbol  of  the  string,  and  so  on.   If  a  match  is  not 
found,  then  an  alternate  production  is  attempted  in  an  effort 
to  recognize  the  symbol  in  the  buffer.   On  completion  of  the 
procedure,  the  input  string  will  be  declared  syntactically 
correct  or  incorrect; 

procedure  LOOKUP  (k) ; 
value  k ;  integer  k ; 

begin  for  j  :=  1  until  npr  d_o 
if  PR(j,l)  =  k  then 
LOOKUP  :=  j ; 
go  to  EXIT  end; 

EXIT:   end  LOOKUP; 
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boolean  procedure  NEXTSYM  (k)  ; 

value  k ;  integer  k ; 
begin  comment   This  procedure  controls  the  accumulator  and  checks 
for  recognition  of  the  symbols  in  the  buffer  with  the  results 
of  the  application  of  production  rules; 
integer  i ,  m ; 
i  :=  (ap  +  1)  -  offset  ; 
if  i  >  acclen  then  begin 
for  i  :=  i  while  (true) 
if  i  —   50  then  begin 

for  j  :=  acclen  +  1  step  1  unti  1  i  d_o  begin 
accum  [j]  :=  SCAN; 

if  check  then  write  (  *  symtab  [accum[j ] ]) ;  end • 
acclen  :=  i  ; 

if  accum  [i]  =  k  then  begin 
ap  :=  ap  +  1 ; 
NEXTSYM  :=  true;  end 
else  NEXTSYM  :=  false; 
go  to  A ;  end 
e lse  begin  offset  :=  offset  +  10; 
begin  for  m  :=  1  until  40  d_o 

accum  [m]  :=  accum  [m  +  10];  end; 
i  :=  i  -  10;  end 
e lse  if  i  <  1  then  begin 

write  ('  depth  of  search  exceeded1); 
go  to  A ;  end 
else  begin 
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if  accum  [i]  =  k  then  begin 
ap  :=  ap  +  1  ; 
NEXTSYM  :=  true  ;  end 
else  NEXTSYM  :=  false; 
end  ; 
A:  end  NEXTSYM: 

boolean  procedure  RECOGNIZE  (production,  element); 
va lue  production,  element; 
integer  production,  element; 
begin  comment   This  procedure  attempts  to  recognize  each  of  the 
symbols  in  the  input  string.   If  a  symbol  is  recognized  then 
RECOGNIZE   is  set  to  true,  otherwise  false; 
integer  k,  lr,  tap; 
lr  :=  0; 

if  element  —  8  then  tap  :=  ap; 
if  element  =  1  then  begin 

if  PR  [production,  1]  =  PR  [production,  2]  then 

if  1  RECOGNIZE  (production,  3)  then  begin 

ap  :=  tap; 

RECOGNIZE  :=  if  PR  [production,  1  ]  =  PR  [production  +1,  1] 
then  RECOGNIZE  (production  +1,1) 
e  lse  false  ; 
go  to  OUT; 
end 
else  RECOGNIZE  :=  true 
else  if  RECOGNIZE  (production,  2)  then  begin 
tap  :=  ap  ; 
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if   check    then   begin 
write    (symbol    found)  ; 

for    production    :=   production  while    (PR    [production,    1] 
PR    [production  +    1,    1J/XPR    [production  +    1,    1]    ^ 
PR    [production  +    1,    2]) 
production    :=   production  +   1 
e  lse    end  ; 
if    lr   ^   0   then  begin 

write    (symbol,    '  numbe.r    of    left    recursive 

production    found'); 
ap    :=   tap; 
RECOGNIZE    :=   true ; 
end 
else   RECOGNIZE    :=  false 
go   to   OUT 
else   begin 
tap    :=  ap ; 

RECOGNIZE    :=  if   PR    [production,    1]    =   PR    [production 
+    1,    1]  /\PR    [production  +1,1]^ 
PR    [production  +1,2]    then 

RECOGNIZE    (production  +1,1) 
else    false  ; 
go   to  OUT; 
end 
else   begin 

k    :=   PR    [production,    element]; 
RECOGNIZE    :=  if  k  M   then 
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if  k  >  nonterm  then 

RECOGNIZE  :=  if  NEXTSYM  (k)  then 

RECOGNIZE  (production, 
element  +  1) 
e  lse  false 
e  lse 

RECOGNIZE  :=  if  RECOGNIZE  (LOOKUP  (k),  1) 
then  RECOGNIZE  (production, 
element  +  1) 
e  lse  false 
else  true  ; 
end 
else  RECOGNIZE  :=  true; 
OUT: 
end  RECOGNIZE; 

if  1  done  then 

begin  if  RECOGNIZE  (1,  1)  then  write  (  'syntax  ok'  ) 
e lse  write  (  'syntax  error'  ) ;  k  :=  0; 

for  k  :=  k  while  (k  £   nsymbols  /\ (done)  d_o  k  :=  scan; 

bp  :=  72;  ap  :=  acclen  :=  offset  :=  linecount  :=  0; 
end  MAIN; 
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V.      CONCLUSIONS 

A.      USES 

The   algorithm   presented   has   many  applications.      However,    one 
application,    more    suitable    than    the    others,    is    use    in   a    time-sharing 
e  nvironment . 

In   a    time-sharing    system,    the    universal   syntax   checker   could    reside 
on  a   direct    access    storage   device,    along   with    the    B.N.F.    definitions, 
to   be    called   when  desired.      Each   terminal   user  would   be   able    to   time- 
share    this    syntax  checker   regardless    of   the    language    used.      The    syntax 
checker  would    provide    a   very   rapid   syntax   check    for   each   user   at   each 
terminal.      As    soon   as    the    first    syntax  error   was    encountered,    syntax 
checking    of    that    program  would   end.      The    user   at   an   on-line    terminal 
would    then   examine    the    incorrect    statement   and   effect   a   correction. 
The    syntax  check  would   restart    the    program  analysis.      Figure    5-1 
depicts    a    possible    configuration   for   a    time    sharing    system  with 
syntax   checker. 

TERMINALS  DASD 

******  *********** 

*  *  *  * 

*******  t  Syntax      * 

\  *.  Checker   * 

■K,  .     !(*********** 

******  *  AJ** 

*  *  ***********************       J,** 

*  *****************  *  A«* 

******  *  COMPUTER      * 

************  * 

>y         ******************** 

******  „*T  * 

*  **  J* 

*       *         >r 

******  ,/r 

******  ** 

*  *** 

*  * 
****** 

Figure  5-1.   Time-sharing  system  with  syntax  checker. 
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B.   FURTHER  RESEARCH  AND  IMPROVEMENTS 

One  obvious  need  is  the  actual  implementation  of  the  universal 
syntax  checker  in  a  time-sharing  system. 

Two  major  improvements  are  also  needed.   First,  a  recovery  pro- 
cedure which  will  a  How  recovery  to  a  logical  restart  point,  to  continue 
syntax  checking,  after  encountering  a  syntax  error.   Second,  there  is 
a  need  for  a  more  complete  and  precise  set  of  diagnostic  messages. 
Simply  to  say  "syntax  ok"  or  "syntax  error"  is  not  enough.   Precise 
statements  of  the  form  where,  what,  and  why  are  needed. 
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APPENDIX  A 

Backus -norma  1  form  requirements. 

1.  Maximum  number  of  characters  per  symbol  is  ten. 

2.  Maximum  number  of  symbols  per  production  is  eight. 

3.  Maximum  number  of  productions  per  B.N.F.  is  three  hundred. 

4.  Left  recursive  productions  must  include  the  trivial  production 
prior  to  the  recursive  production. 

5.  Cyclic  productions  require  the  trivial  productions  as  in  4. 

6.  The  B.N.F.  must  be  ordered  to  present  the  longest  string  possible 
prior  to  any  other  production. 

7.  The  character  string  $PROGRAM  is  not  allowed  in  a  production  for 
a  language.  This  string  is  used  as  an  indicator  to  make  the  end 
of  tne  B.N.F.  and  the  start  of  each  program  submitted  for  parsing 

8.  Place  the  characters  $PROGRAM  immediately  after  the  B.N.F.  sub- 
mitted and  immediately  after  each  program  submitted  for  syntax 
checking . 
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